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ABSTRACT 

Large spectroscopic surveys require automated methods of analysis. This paper explores the use 
of k-means clustering as a tool for automated unsupervised classification of massive stellar spectral 
catalogs. The classification criteria are defined by the data and the algorithm, with no prior physical 
framework. We work with a representative set of stellar spectra associated with the SDSS SEGUE 
and SEGUE-2 programs, which consists of 173,390 spectra from 3800 to 9200 A sampled on 3849 
wavelengths. We classify the original spectra as well as the spectra with the continuum removed. The 
second set only contains spectral lines, and it is less dependent on uncertainties of the flux calibration. 
The classification of the spectra with continuum renders 16 major classes. Roughly speaking, stars 
are split according to their colors, with enough finesse to distinguish dwarfs from giants of the same 
effective temperature, but with difficulties to separate stars with different metallicities. There are 
classes corresponding to particular MK types, intrinsically blue stars dust-reddened, stellar systems, 
and also classes collecting faulty spectra. Overall, there is no one-to-one correspondence between the 
classes we derive and the MK types. The classification of spectra without continuum renders 13 classes, 
the color separation is not so sharp, but it distinguishes stars of the same effective temperature and 
different metallicities. Some classes thus obtained present a fairly small range of physical parameters 
(200 K in effective temperature, 0.25 dex in surface gravity, and 0.35 dex in metallicity) , so that 
the classification can be used to estimate the main physical parameters of some stars at a minimum 
computational cost. We also analyze the outliers of the classification. Most of them turn out to be 
failures of the reduction pipeline, but there are also high redshift QSOs, multiple stellar systems, dust- 
reddened stars, galaxies, and, finally, odd spectra whose n ature we have not decipher. The te mplate 
spectra representative of the classes are publicly available (ftp : //stars :kmeans@f tp . iac . es[ ). 

Subject headings: methods: data analysis - methods: statistical - astronomical databases: miscella- 
neous - stars: fundamental parameters - stars: general 
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1. INTRODUCTION 

Stellar spectra contain a wealth of information on the 
photospheres of stars, including their chemical makeup. 
In spite of decades of measuring and studying stellar 
spectra, we have a very limited knowledge of how many 
stars with a given chemical composition exist in the 
Galaxy. Large spect roscopic surveys such as RAVE 
dSteinmetz et all 120061 ) or SDSS (e.g., iStoughton et all 
120021 ) have increased by several orders of magnitude the 
number of stars with measured spectra, yet these surveys 
are limited in scope, reaching only a particular cross- 
section of the stellar population of the Milky Way, and 
biased, in that they target stars that have been prese- 
lected based on their color, distance or brightness. This 
situation w ill soon change as new large projec ts such as 
Gaia (e.g., iTuron et all 120051) . or HETDEX (|Hill et all 
120081) . with well controlled samples, appear in the scene. 
The advent of these new data sets poses an obvious prob- 
lem. As the data flow continuously increases, performing 
systematic studies becomes more of an issue, and new ef- 
ficient means of analyzing stellar spectra are needed. 

The study of stellar spectra to quantify the physical 
properties of stars requires model atmospheres, radiative 
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transfer calculations, atomic and molecular data, and in- 
volves a large number of approximations. The model 
ingredients and our recipes to apply them to interpret 
observations are in constant evolution, and so are the 
values for the inferred quantities. Avoiding such changes 
is one of the motivations behind spectroscopic classifi- 
ca tion in general , and in par ticular of the MK s ystem 
of IMorgan et ail (|1943l ) and IMorgan fc Keenanl |l973f ) 
which is still in use today enhanced with extensions. It 
assigns spectral classes in a way that is purely empirical 
and repeatable, providing basic information as a prelim- 
inary step for further, more detailed, analysis. 

This kind of typing system based on a series of pre- 
defined criteria is feasible as long as the criteria are 
set and applied by humans. Such systems resemble 
the taxonomical classifi cation of animal species (see 
iFraix-Burnet et all 120061 ) and, by definition, are not op- 
timal since they rely on subjective judgments. If such 
systems have to be updated for application to very large 
number of spectra, the classification has to be easily im- 
plemented and performed with c omputers, in a fa st and 
homogeneous fashion. Moreover, Sandagc (2005) main- 
tained that physics must not be used to drive a classi- 
fication; otherwise the arguments become circular when 
using the classification to support physics. Thus, one is 
inclined to consider unsupervised classification systems, 
which are themselves defined by algorithms and data. 
The potentials of one of such methods, k-means, is ex- 



2 



plored in this paper. However, it must be stressed that 
unsupervised does not mean absolute or model indepen- 
dent. The classification criteria are implicitly set by the 
algorithm, and the resulting classes depend on the spe- 
cific dataset under analysis. 

In data mining parlance, the spectrum of a star 
is a point in the high-dimensional space where each 
coordinate corresponds to the intensity at a particular 
wavelength. Given a comprehensive set of stellar 
spectra, classifying consists of identifying clusters in 
this high-dimensional space. The problem of finding 
structures in a multidimensional data set goes a l so by 
the name o f cluster analysis (see e.g., lEverittl 119951 : 
iBishod 120061 ). One of the most widely used algorithms 
is k-means clustering (Mac Queenl U967). and it fulfills 
the requirements put forward above. Moreover, k-means 
is simple to code and robust, even when exploring clus- 
tering in a high-dimensional space. Previous works have 
shown successful applications of the method to the clas- 
sification of spectra in v a rious astrophysical contexts , 
e.g., stars (jBalazs et al.l 119961: iSimpson et al.l [ 2012). 
solar p olarization spectra (| Sanchez Almeida fc Litesl 
120001 : iViticchie fc Sanchez Almeidal 120111 ). X-ray 
spectra dHoinacki et al.l 120071) . spectra from as- 
teroids dGalluccio et al.l 120081), and galaxy spectra 
(iSanchez Almeida et al.l 120091 120101: iMorales -Luis et afl 
2011). 

We now explore its application to medium-resolution 
stellar spectra from the Sloan Digital Sky Survey. SDSS 
currently provides the largest available homogenous 
data base of stellar sp ectra. The original SPSS sur- 
vey (jStoughton et al.l 120021: lAbazaiian et al.l I2009D to- 
gether with SEGUE and SEGUE-2 contain somewher e 
over half a million stellar spectra (jYannv et al.l 12009). 
The Baryon Oscillati ons Spectroscopic Survey (BOSS; 
lEisenstein et!Ill2T)lll) . part of SDSS-III, uses similar but 
upgraded spectrographs, and in the first two years of 
operation has already obtained over 100,000 additional 
spectra of stars. Even though these stars do not pro- 
vide a fair sample of the Milky Way stellar population, 
this rich data set is a good place to explore the appli- 
cation of clustering algorithms for classifying stars. Ac- 
tually, the set has a lready been used for this purpose. 
iMcGurk et alj (|2010[) apply principal component anal- 
ysis (PCA) to spectra having narrow color bins, so as 
to separate stars of the same effective temperature ac- 
cording to their gravity and metallicity. In this case the 
spectra are not classified directly by the automated pro- 
cedure, but the color c uts introduce human supervision 
into the classification. Da niel et al.1 ([201 ll ) apply local 
linear embedding (LLE), which is a type of PCA decom- 
position that preserves the nonlinear structure within 
high-dimensional data sets. The stellar spectra are found 
to form a ID family w hen projected into the first three 
eigenv ectors. Finally, Ase nsio Ramos fc Allende Prietol 
(2010) also use SDSS data to show that stellar spectra 
are highly compressible, so that a small number of pa- 
rameters suffice to reproduce the bulk of the observed 
spectra. 

In this paper we focus on the k-means classification 
of stars ob served as part of t he SEGUE and SEGUE- 
2 surveys (|Yannv et al.1 12009D . Section [5] presents the 
selection of spectra used in the analysis, and Sect. Ode- 
scribes the basics behind k-means. Section [4] applies the 



algorithm to SDSS spectra, first considering continuum 
(Sect. l4~Tj) . and then without continuum (Sect. l4~2l) . The 
classification allows us to identify outliers, often rare ob- 
jects that do not show up unless the catalogs are large 
enoug h and which turn o ut to be extremely revealing 
(e.g. lMatijevic et al.ll2012T ). The properties of these out- 
liers are analyzed in Sect.[SJ Section [5] explains the main 
results and outlines additional uses of the classification. 

2. DATA SET 

The spectr a come from SEGUE and SEGUE-2 
(jYannv et al.l I2009D . These programs obtained stellar 
spectra using the SDSS 2.5-m teles c ope a n d the SPSS 
doub le spectrograph ([Gunn et al.l 120061 : iSmee et all 
I2012t ) between 2004 and 2009. The spectra contain 
3849 wavelengths covering the range 3800-9200 A at a 
resolving power R ~ 1800. They w ere downloaded from 
the SDSS Data Release 8 (DR8; lAihara et aLll201lD . 
SEGUE observed numerous types of targets, from very 
hot WD to very cool M and L types, each chosen based on 
color criteria, and in some cases additional information 
such as proper motion. The survey observations sample 
the Galaxy at mid and high galactic latitudes, covering 
very sparsely 3/4 of the sky, with only 3 plates (less than 
0.5% of the spectra) at |6| < 10°. 

The sample of DR8 spectra associated with SEGUE 
and SEGUE-2 programs include 355,840 spectra in 
525 plug-plates. Each plate, as for SDSS, admits 
640 fibers. We selected stars with a median signal- 
to-noise ratio S/N > 10, a radial velocity module 
smaller than 600 kms" 1 (redshift < 0.002), and which 
were not labeled as one of the following classes of ob- 
jects: GALAXY, NA, QA, ROSAT.D, QSO, SER, and 
CATY.VAR. 

We processed the spectra to eliminate by interpola- 
tion the [OIII] nightglow lines at 5577 and 6300 A, and 
corrected the Doppler shifts associated with the radial 
velocities. The original spectra have units of flux per 
unit wavelength (i.e., ergcm~ 2 s^ 1 A" 1 ). We normalized 
them dividing by the median value of their fluxes in the 
spectral band between 5000 and 6000 A. This step pre- 
serves the shape of the spectral energy distribution, while 
places all the spectra on the same scale regardless of the 
intrinsic luminosity of the stars, their distance, and the 
amount of interstellar absorption. Missing sections of 
spectra were patched by interpolation, since any regions 
with extreme (wrong) values can damage the classifica- 
tion algorithm. A sample 173,390 stellar spectra passed 
all the selection criteria, and we refer to them as the ref- 
erence set. Interstellar extinction is not corrected for, 
however, we analyze a version of the same data with a 
running mean subtracted from the spectral energy distri- 
bution, leaving only absorption features. This procedure 
is intended to partly remove the reddening produced by 
interstellar extinction, thus minimizing its potential im- 
pact on the results. 

Two additional sets of spectra are mentioned in the pa- 
per. They were used only for testing in early stages of the 
work, but they are explicitly mentioned here because the 
sanity checks performed with them provide confidence 
on the technique. As far as we can tell, they are equiv- 
alent to the reference set in a statistical sense. For lack 
of imagination, we refer to them as 1st auxiliary set and 
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2nd auxiliary set. Auxiliary set 1 comprises 63,611 stellar 
spectra from 3800 to 8000 A, drawn from SDSS/DR6 and 
then purged to retain only those with best S /N. They are 
uniformly resampled in log- wave length as to have 1617 
wavelengths. This set is particularly rich in main se- 
quence F-type stars, since they were used as spectropho- 
tometric calibrators in the original SDSS survey (only 
162 distinct SEGUE plates were included in DR6). Aux- 
iliary set 2 comes from DR8, but the noise thresholding 
and other selection criteria differ from the reference set, 
and leaves only 121,272 targets. 

Effective temperatures T c g, surface gravities (logg) 
and metallicities ([Fe/H]) computed with th e SEGUE 
Stellar P arameter Pipeline (h e reafte r SSPP: ILee et all 
2008albt lAllende Prieto"et~aTl [200l ILee et all |20TTF 
Smolinski et all 120111) are used here to characterize the 
physical properties of the stars. Each one of these phys- 
ical parameters results from the robust average of vari- 
ous independent estimates, discarding those which do no t 
seem to be consistent (for details, see ILee et all 12008a) ). 
The MK types of the targets mentioned in the paper are 
also from the SSPP. Most of our discussion is based on 
the so-called ELODIE MK types, derived by best-fitting 
tem plates from the ELODIE library of synthetic spec- 
tra ijPrugniel fc Soubiranll200l . The distribution of MK 
types of the reference set is shown in Fig. Q] These types 
encompass a broad range of stellar properties from O to 
L and include WDs, however, they face some difficul- 
ties when interpreting intermediate spectral types (e.g., 
the distribution of MK types in Fig. Q] peaks at type F 
whereas most of the SEGUE stars have been selected 
to be of G type; see lYannv efafl l27)09h . The SSPP 
also provides a second set of MK types only for cool 
stars, the Hammer MK types, which were inferred using 
the spectral t yping software developed and described by 
ICovev et all m0% . They cope much better with types 
G and F, and we apply them when appropriate. 

3. CLASSIFICATION ALGORITHM 

We use k-means to carry out the classification, which is 
a robust tool com monly used in data mining and artificia l 
intelligence Ce.g.. lEverittlll99l IBradlev fc Favvadl lT9981 
The actual realization of the alg orithm employed in our 
analys is is described in detail by Sanchez Almeida et all 
(pOlft § 2), and we refer to that work for details. How- 
ever, for comprehensiveness, this section sketches the op- 
eration of the algorithm, with its pros and cons. 

As for most classification algorithms, the stellar spec- 
tra are vectors in a high dimensional linear space, with as 
many dimensions as the number of wavelengths. There- 
fore the spectra to be classified are a set of points in 
this space. The points (i.e., the spectra) are assumed 
to be clustered around a number of centers. Classifying 
consists of (a) finding the number of clusters and their 
centers, (b) assigning each spectrum to one of these cen- 
ters, and (c) estimating the probability that the choice 
is correct. The third step should be regarded as a sanity 
check that allows us to quantify the goodness of the clas- 
sification for each particular spectrum. In the standard 
formulation, k-means starts by selecting at random from 
the full data set a number k of spectra. These spectra 
are assumed to be the center of a cluster. Then each 
spectrum of the data set is assigned to the cluster center 



that is closest in a least squares sense0. Once all spectra 
have been assigned to one of the classes, the cluster cen- 
ters are recomputed as the average of the spectra in the 
cluster. The procedure is iterated with the new cluster 
centers, quitting when most spectra are not re-assigned 
in two successive steps (99 % of them in our realization) . 
The number of clusters is arbitrarily chosen but, the re- 
sults are insensitive to such selection since only a few 
clusters possess a significant number of members. Thus 
the algorithm provides the number of clusters, their cor- 
responding cluster centers, as well as the classification of 
all the original spectra now assigned to one of the clus- 
ters. This information completes steps (a) and (b) of the 
classification procedure. In order to estimate the proba- 
bility that the assignation is correct (step c), we compute 
for each cluster the distribution of the distances to the 
cluster center considering all spectra assigned to the clus- 
ter. We then assume that this distribution describes the 
probability that any star with a given distance from the 
cluster center belongs to the class. Specifically, the prob- 
ability that a given star belongs to a cluster is estimated 
as the fraction of stars in the cluster with distances equal 
to or larger than the distance of the star. It is a sensible 
assumption; it gives high probability to spectra close to 
the cluster center, and then drops down smoothly toward 
the outskirts of the cluster. The scale of this smooth de- 
crease is provided by the measured distribution of dis- 
tances in the class. 

The algorithm is simple, robust and fast, which makes 
it ideal to treat large data sets. It guarantees that similar 
spectra end up in the same cluster. Moreover, it is unsu- 
pervised since no prior knowledge of the stellar properties 
is used, and the spectra to be classified are the only in- 
formation passed on to the algorithm^. These two prop- 
erties ensure that the resulting classification is not biased 
by our (physical) prejudices, which fol lows the s pirit o f 
a good classification as advocated by iSandagd (|2005|) . 
Unfortunately, it also has three major drawbacks. One 
of them is technical, whereas the other two deal with 
the physical interpretation of the classes. The algorithm 
yields different classifications depending on the random 
initialization. This difficulty is overcome by repeating 
the classification multiple times, thus studying the de- 
pendence of the final classes on the random seeds. In 
addition, our implementation refines the initialization so 
that the random seeds are not chosen uniformly but ac- 
cording to the distrib ution of points in the classifi cation 
space (for details, see iSanchez Almeida et al.l l2010). The 
second difficulty has to do with interpreting the classes 
as actual clusters in the classification space, or as parts of 
larger structures. The algorithm does not guarantee that 
the derived classes correspond to actual clusters. How- 
ever, one can figure out whether each particular class is 
isolated or belongs to a larger structure by studying the 
distances of the spectra in the class to the other classes. 



3 This means using the Euclidean metric to assign distances be- 
tween points in the high dimensional classification space. Actually, 
we use the plain Euclidean distance, where all the wavelengths arc 
equally weighted (Sanchez Almeida et al. 2010, Eq. (2)). Observa- 
tional errors are not included in the metric for simplicity. 

4 For the sake of comparison, artificial neuronal network clas- 
sifications use a training set that informs th e algorithm on t he ex- 
isting spectra and spectral types - see, e.g., Navarro et al. (2012) 
and references therein. 
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Fig. 1. — Distribution of MK spectral types in the reference dataset. We use ELODIE-based types as provided by the SSPP. 



Well defined classes contain stars that are distant from 
the other classes. The third difficulty refers to the phys- 
ical interpretation of the resulting classes, which is not 
provided by the algorithm. The physical sense of a par- 
ticular class and its cluster center (dubbed in the paper 
as template spectrum) has to be figured out later on. Ac- 
tually, most of this paper is devoted to this task, i.e., to 
interpreting in terms of known stellar physics the classes 
resulting from the k-means classification. 

3.1. Repeatability of the classification 

As customary, the dependence of the classification on 
the random initialization was studied by repeating the 
classification 100 times, and then comparing the results. 
This internal comparison was carried out using three pa- 
rameters that we name: (1) coincidence, for the percent- 
age of spectra in equivalent classes, (2) dispersion, for the 
rms fluctuations of the spectra in a class with respect its 
cluster center, and (3) number of classes, for the num- 
ber of classes that contain 99 % of the spectra (major 
classes). In order to decide which classes in two differ- 
ent classifications are equivalent, we compute the number 
of stars in common between each pair of classes formed 
by one class from one classification and the second class 
from the second classification. The two classes sharing 
the largest number of stars are assumed to be equivalent. 
The same criterion is repeated until all the classes of one 
of the classifications have been paired. This criterion 
maximizes the number of stars sharing the same class in 



the two classifications. 

Figure [5] shows scatter plots of the three diagnostic pa- 
rameters corresponding to repeated classifications of the 
reference dataset (§ [2]) including continuum. One finds 
classifications having between 15 and 20 major classes, a 
dispersion in the range 0.07 to 0.08, and a mean coinci- 
dence between 62 % and 68 %. The ranges in these values 
are fairly narrow. The fact that the coincidence is about 
65 % means that one can pair the classes of any two clas- 
sifications, and they will share about 65 % of the spectra. 
(These apparently low values are discussed below.) The 
fact that the dispersion is of the order of 0.075 implies 
that the differences between the class template spectra 
and the spectra in the class are of the order of 7.5 % rms. 
These numbers refer to the reference dataset including 
continua (§ 14. lj) but are similar to those obtained when 
spectra without continua are used (§ I4.2j) . or when using 
the auxiliary sets. 

In addition to the above tests, we made a numerical ex- 
periment splitting the 1st auxiliary set into two randomly 
chosen disjoint subsets, which were classified indepen- 
dently. The differences give an idea on the dependence 
of the classes on the particular dataset that is employed. 
The results are summarized in Fig. [3j which shows the 
template spectra of equivalent classes in the two classi- 
fications. (Only nine classes are included, but they are 
representative of the general behavior.) The differences 
between spectra are also plotted, and turn out to be of 
a few percent, i.e., smaller than the scatter among the 
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Fig. 2. — Scatter plots with the three parameters characterizing 
the 100 independent classifications of the reference dataset with 
continuum, (a) Percentage of stars common to all other classi- 
fications (coincidence) versus typical scatter of the spectra with 
respect to the class template (dispersion), (b) Number of major 
classes (i.e., having 99% of the spectra) versus dispersion, (c) Co- 
incidence versus number of major classes. 

spectra included in each class (the dispersion of these 
classifications was of the order of 5%). We have also 
computed the colors of the templates and of the indi- 
vidual stars, and the differences between the colors of 
the templates (~0.025 mag) are smaller than the scatter 
among individual stars in a class (~0.05 mag). 

As mentioned above, repeating the classification sev- 
eral times leaves only 60-70% of the spectra in equivalent 
classes. This issue with k-means is not directly due to 
the random initialization - the cluster centers in equiv- 
al ent classes are very similar, a s shown in Fig. |3] and 
in lSanchez Almeida et all (|2010L Sect. 2.1). It is due to 
the fact that k-means slices a rather continuum distribu- 
tion. Then small changes in the borders between clusters 
produce a significant relocation of the in-between stars. 
The effect is boosted because the clusters are in a high 
dimensional space (see the analytic .justi fication in the 
appendix of iSanchez Almeida et all [2010). Wc show in 
Sect. 14.11 how some of the resulting clusters really have 
many stars near their borders. 

The actual classifications were carried out in paral- 
lel using several workstations. The reference data set 



has a volume of 6.1GB, and the process requires some 
two CPU hours per single classification, or two hundred 
hours for studying the effects of the random initialization. 
These figures are mentioned to stress that the k-mean 
classification of a sizable data set, including studying ini- 
tial conditions, is easily doable using standard hardware 
facilities. 

4. THE CLASSIFICATION: CLASSES AND THEIR 
MAIN PROPERTIES 

The description given in this section refers to the ref- 
erence set defined in § [2j however, the same procedures 
and analysis have been repeated with the two auxiliary 
sets, giving always consistent results. 

The classifications use all the 3849 wavelengths equally 
weighted. We consider two cases for the analysis. In the 
first case, the full spectra are used f§ l4.1[) . In the second, 
a running average 193 pixels wide (=17,400 kms -1 ) is 
subtracted from each spectrum (§ I4.2[) . This high-pass 
filtering removes the continuum but leaves the spectral 
lines almost untouched. 

4.1. Classes with continuum 

In order to study the dependence of the classes on the 
random seeds inherent to k-means, we carried out 100 in- 
dependent classifications (§ 13. ip . They are equally valid 
classifications, but we have to choose among them one to 
be used as the classification. Taking advantage of having 
all these possibilities, we try to selected the one that is 
as representative as possible of all of them, the spectra 
in their classes are as similar as possible, and has the 
smallest number of classes. In the parlance used in 13.11 
we try to select a classification having large coincidence, 
small dispersion and few classes. The scatter plots for 
these three parameters among the 100 independent clas- 
sifications are shown in Fig. O Asking the coincidence 
to be larger than 66 %, the dispersion to be smaller than 
0.074, and the major classes to be fewer than 17, one 
finds only two classifications. Those are represented as 
asterisks in Fig. [2] For lacking a better criterion, we 
choose one of them at random. Its coincidence is 67%, 
its dispersion 0.073, and it has 16 major classes (26 in to- 
tal, but some seem to correspond to failures of the SDSS 
pipeline, as we explain later on). 

The average of all the spectra in the classes (i.e., the 
cluster centers or the cluster templates) are shown as 
solid lines in Fig. [4] The figure also includes the stan- 
dard deviation among all the spectra in each class (the 
dotted line), which quantifies the intra-class dispersion. 
The number of stars in each class is represented as an 
histogram in Fig. [5l It shows that the classes have been 
numbered according to the stars than contain, being 
class the most numerous, class 1 the second most nu- 
merous, and so on and so forth. Since the template spec- 
tra come from averaging thousands of individual spectra, 
they have extremely high signal-to-noise ratios - from 
200 to 2000 depending on the number of spectra in the 
class. The spectra of classes 22 and 24 are not included 
in Fig. |H They collect faulty spectra that are similar 
to class 17 (see the template spectrum in Fig. [|J that 
has a large unphysical spike at the bluest wavelength). 

The templates are also represented in Fig. [6] as a 
stack-plot ordered so that the image looks as smooth as 
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Fig. 3. — Spectra of several equivalent classes resulting from the k-means classification of two disjoint subsets drawn from the 1st auxiliary 
set. We show the two template spectra and their difference (uplifted adding 0.5 so as to fit in the plot); see the inset in the top left panel. 
Wavelengths are given in lira. 



possible. This image excludes those classes that are fail- 
ures of the pipeline (classes 17, 22 and 24) and binary 
systems (classes 20 and 21; see below). 

As we discuss in Sect. [3l k-means does not guaran- 
tee the inferred classes to be real clusters in the clas- 
sification space. They may be parts of larger struc- 
tures that have been sliced by the algorithm. A way 
to study whether the cl a sses a re isolated was explored in 
Sanc hez Almeida et al.l (|2010l) . and it is used here too. 
One can estimate the probability that each particular 
star belongs to the class it was assigned to. It depends 
on how far from the cluster center the star is as compared 
to the other members in its class. Similarly, one can es- 
timate the probability of the star belonging to any other 
class. Well defined clusters will have most of their ele- 
ments with a probability of belonging to any other cluster 
significantly smaller than the probability of belonging to 
the cluster. Figure [7] shows histograms of the ratios be- 
tween probabilities of belonging to the 2nd nearest clus- 
ter and to the assigned cluster for a few representative 
classes. There are classes where the histogram peaks at 
low ratios, thus indicating a well defined structure in the 
classification space (e.g., class in Fig. [7]). Conversely, 
other classes present a rather flat histogram indicating 
a dispersed structure (e.g., class 3 in Fig. [7]). Classes 



3, 5, 11 and 19 represent spread-out classes, whereas 
the rest are clustered classes. Note, however, that even 
the histograms of well defined clusters have a significant 
tail towards large ratios, indicating the presence of many 
stars in the boundaries between clusters. Those stars are 
partly responsible for the non-uniqueness of the classifi- 
cation studied in Sect. 13.11 

The physical interpretation of some of the classes is rel- 
atively straightforward. Classes 20 and 21, with upturns 
in both the blue and the red, are most likely compos- 
ite spectra of systems with two (or more) stars with very 
different effective temperatures (T c ff)- They can be grav- 
itationally bounded stellar systems, or stars that happen 
to be along the line of sight. The luminosity of the stars 
that contribute to the combined spectrum has to be sim- 
ilar, therefore, in case of binary systems both stars can- 
not be in the main sequence because the hot star would 
outshine any cold companion. One common possibility 
is a hot white dwarf (WD) and a cold dwarf or giant, 
and this is indeed the conclusion reached when trying to 
reproduce the templates of classes 20 and 21 as a linear 
superposition of templates of two other classes. The best 
fit is obtained combining classes 16 and 18, as shown in 
Fig. [H As we discuss later on, class 16 contains WDs, 
and class 18 corresponds to K-type giants. We note that 
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Fig. 4. — Template spectra of all classes in the classification that includes continuum (the solid lines). The spectra are normalized to 
the intensity at some 5500 A, and the individual plots are scaled from minimum to maximum. The classes are identified in the insets, and 
they have been ordered from red to blue (from left to right and from top to bottom) following Fig. [6] This order breaks down with the 
abnormal classes 19, 20, 21 and 17, that are shown at the end of the sequence. Classes 22 and 24 are not included, since they correspond 
to failures in the reduction pipeline and are similar to class 17. The panels also include the intra-class standard deviation, that quantifies 
the dispersion among the spectra included in the class (the dotted lines). Wavelengths are given in (mi. 
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Fig. 5. — Histogram with the number of spectra in each class 
as derived from the k-means classification of stellar spectra that 
includes continuum. The class number has been assigned according 
to the number of stars in the class, being class the class with the 
largest number of elements. 
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Fig. 6. — Composite image with the templates spectra of all 
the classes in the k-means classification of spectra with continuum. 
They have been ordered so that the image looks smooth. All spec- 
tra are normalized to their maximum intensities. The color palette 
tries to mimic the human eye sensitivity. The classes that corre- 
spond to failures of the pipeline (classes 17, 22 and 24) and binary 
systems (classes 20 and 21) are excluded. 



the templates of classes 20 and 21 resemble spectra of 
post-common envelope b inaries, as identified and studied 
using SPSS da t a (e.g . iRebassa-Mansergas et al. [20071 : 
iSchreiber et~aH 120081: IRebassa-Mansergas et all 120081 ). 
Some other classes are really awkward, and so difficult to 
interpret unless they are associated with failures in the 
reduction pipeline (e.g., classes 17 and 19). 

Figure [S] shows an image with the distribution of u ~ g 
vs g — r colors of the full set, with the individual classes 
overlaid as contours containing 68% of the stars. The 
colors have been derived from the spectra using the 
transmission bandpasses of the broad-band SDSS filters. 
Note that the classification is basically a color classi- 
fication. Disregarding classes gathering failures of the 
pipeline (classes 17, 19, 22, and 24) and multiple sys- 
tems (classes 20 and 21), the k-means classification of 
stellar spectra with continuum seems to separate stars 
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Fig. 7. — Histograms of the ratio between probabilities of be- 
longing to the 2nd nearest cluster and to the assigned cluster for 
stars included in classes to 3 (as the inset indicates). The abscis- 
sae are by definition bound between and 1. Histograms peaking 
at low ratio characterize well defined classes (e.g., class 0) whereas 
histograms with large counts at large ratios indicate a fuzzy class 
(e.g., class 3). 
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Fig. 8. — The class 21 template spectrum (the solid line), with 
upturns in the blue and the red, seems to be the composite spec- 
trum of a stellar system. Assuming it to be binary, the best fit 
(the dotted line) is obtained as a linear superposition of a WD 
(class 16) and a late type giant (class 18). The fit is not perfect; 
the difference between class 21 and the best fit is shown as the 
dashed line. 

according to their position on the color-color plot. The 
classes form a one-dimensional set in the diagram, with 
a bifurcation at g — r ~ 0.5. The bifurcation separates 
dwarf stars (on top) from giant stars, a result further dis- 
cussed below. We note, in passing, that multiple systems 
are well separated in this color-color plot and, therefore, 
it can be used to select them. 

Main sequence stars have log(g) larger than 3.8 
(with the surface gravity g in cms -2 ), which is even 
larger (log(g) > 4T for stars with T off < 9000 K (e.g. 
IDrilling fc Landolt] I2000D . Figure [TO] shows the two- 
dimensional distribution of log(g) vs T e g, together with 
contours with the region containing 68 % of the stars in 
the class. The gravities and effective temperatures of 
individual stars have been taken from the SSPP (param- 
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Fig. 9. — u — g vs g — r plot for the full set (the image in the 
background) and for the different classes separately. The contours 
show the region with 68% of the spectra in the class, and have 
been labeled with the corresponding class number. This plot cor- 
responds to the full spectrum classification. Classes 17, 19, 22, and 

24 seem to be failures of the reduction pipeline. Classes 20 and 21 
are binary or multiple systems with spectra blue in the blue filters 
(i.e., u — g) and red in the red filters (i.e., g — r). The rest of classes 
form a ID sequence that bifurcates at g — r ~ 0.5 (at classes and 
3). 

eters labeled ADOP), as we explain in Sect. [2] The plot 
does not include all the classes since many of them over- 
lap and would clutter the figure. Only those classes rel- 
evant for our argumentation are included, in particular 
classes and 3 have similar T e g but are parts of the two 
different branches of the color sequence (see Fig. ;9). Note 
that class gathers only main sequence stars (log(g) > 4) 
whereas most class 3 targets are giants. Something sim- 
ilar occurs with the pairs of classes 1 and 5, and 8 and 
ff. Classes 0, 1, 4 and 8 contain only main sequence 
stars (see classes and 8 in Fig. [T0|) . Classes 14 and 18 
contain only giant stars (see class 18 in Fig. ITU]) . Sev- 
eral classes do not have enough valid T e g and log(g) to 
know their location in the log(g) vs T e g plot, includ- 
ing the classes with faults plus classes 16 and 25. Class 

25 is a minor class with few elements, but class 16 is 
not. The lack of effective temperatures and gravities 
for class 16 seems to be associated with the fact that 
it collects WDs, for which no proper physical data are 
provided by the SSPP (but see, e.g., jEiscnstci n et al.l 
[20061 iKleinmanl [20101 iTremblav etaLl 120111 ). We note 
that classes 20 and 21, multiple systems whose spectra 
combine hot and cold components (see Figs. [4] and [8]), ap- 
pear in Fig. [10] as main sequence stars with T e g between 
5000 K and 6000 K. They are also metal rich systems ac- 
cording to the plot discussed in the next paragraph. 

Figure [TTJ shows the two-dimensional distribution of 
[Fc/H] vs T c ff, together with contours with the region 
containing 68% of the stars in the class. As the rest 
of physical parameters of stars, the metallicity [Fe/H] 
comes from the SSPP. The figure shows how the classes 
often contain both high and low metallicity stars. If the 
threshold between low and high metallicities is set at one 
tenth of the solar value (i.e., [Fe/H]= —1), the classes 
that contain only high metallicity stars are 0, 1, 4, 8, 
14, and 18. Similarly, classes 12 and 15 include only low 
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Fig. 10. — log(g) vs T e ff for different classes resulting from the 
classification of stellar spectra with continuum. The contours con- 
taining 68 % of the stars in the class. Only classes 0, 3, 7, 8, 9, 
15 and 18 are included to avoid cluttering the figure. The class 
numbers have been placed at the location of the mean of the cor- 
responding distribution. 
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Fig. 11. — [Fe/H] vs T e g for some representative classes result- 
ing from the classification of spectra including con tinuum. The 
represented classes are the same as those in Fig. 1101 The contours 
embrace 68 % of the stars in the class, and the class numbers have 
been placed at the mean point of the corresponding distributions. 

metallicity stars. Some of these classes are included in 
Fig. [TTJ Class 18 contains low gravity, low temperature 
high metallicity stars - probably K giants. Class 15 con- 
tains low gravity high temperature low metallicity stars. 

The k-means classes do not exactly coincid e with the 
classica l MK types assigned to the stars by ILee et aLI 
(2008b]) (see Sect. [5]). Figure [T2l presents the histogram 
of MK types corresponding to the k-means classes in 
Figs. [10] and [TT] As the histograms show, most classes 
can be ascribed to a single MK type or to a narrow range 
of them (e.g., class 0, 8 and 9). However, the spread in 
MK types is sometimes large (e.g., class 18), becoming 
extreme in the bluest classes (e.g., class 15), which often 
group A-type stars (mainly on the horizontal branch) 
with WD.' 

To recapitulate, the k-means classification of spectra 
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Fig. 12. — Histograms o f the dis trib ution of MK types of selected k-means classes corresponding to spectra with continuum. (The classes 
are those chosen for Figs. [I0l and Splitting the histograms into two panels avoids overcrowding the figure. The MK types shown 

here and in Fig. [T] are directly comparable. The histograms have been normalized to one, including objects for which the MK type is not 
available. 



with continuum seems to be basically a color classifica- 
tion. The stellar colors are driven mostly by the overall 
continuum shape, therefore, the k-mean classes separate 
stars according to their continua. 

4.2. Classes with continuum removed 

As explained in the previous section, the k-means clas- 
sification of stellar spectra with continuum is essentially 
a color classification. Since the colors are dictated mostly 
by the continuum, the classification is driven by the 
shape of the continuum. Dust extinction and errors 
in the spectro-photometric calibrations corrupt continua 
but not so much spectral lines, which retain the infor- 
mation on the stellar properties. In order to study the 
potential of k-means to identify and separate spectra ac- 
cording to subtle spectral-line differences, we repeated 
the classification using spectra without continuum. Ex- 
plicitly, the spectra to be classified are the original spec- 
tra after removal of a running mean filter 193 pixels wide, 
which corresponds to 170 A in the blue and 400 A in the 
red. The width of the numerical filter was determined 
as a trade off to be broader than most spectral features, 
yet narrow enough to be representative of the local con- 
tinua. Because SDSS spectra are sampled in log wave- 
length, our constant width in pixels represents a single 
Doppler broadening of the order of 17,400 kms -1 . 

The procedure leading to the classification is similar 
to that u sed for the full spectra described in the previ- 
ous Sect. 14.11 The dispersion, coincidence, and number 
of major classes were used to select one among 100 in- 



dependent initializations. The selection criteria try to 
make this classification as representative of the rest as 
possible. The selected class has coincidence = 75.7%, 
dispersion = 0.051, 13 mayor classes, and 1 minor class. 
Note that the coincidence is larger than that for the clas- 
sification with continuum, and the dispersion and num- 
ber of classes smaller. The number of stars in each class 
is shown in Fig. [J3J As we did for the classes resulting 
from classifying the spectra with continuum (Sect. l4~Tj) . 
the new classes are also named class 0, class 1, and so on, 
with the number increasing as the elements in the class 
decrease. The average spectra of all the stars in each 
class are shown in Fig. [JJJ We just show a small portion 
of the blue spectrum where individual spectral lines can 
be appreciated. (Two spectra spanning the full spectral 
range are shown for illustration, but then it is impossi- 
ble to appreciate details of the lines.) Note that each 
template spectrum is the average of thousands of indi- 
vidual spectra, so all the small noise-looking wiggles are 
real spectral features. The full templates are shown as a 
stack-plot image in Fig. [TBI (c.f. Fig- |6j) . They have been 
ordered so that the sequence looks as smooth as possi- 
ble. The Balmer lines, which are the only ones present in 
class 12, decrease in strength as one moves up in the im- 
age. The conspicuous molecular bands of TiO are present 
only in class 7. Figure [TB] also shows the average spec- 
tra of the classes, but this average was computed using 
the original spectra with their continua intact. The com- 
parison of these spectra with those corresponding to the 
classification including continuum (Fig. |4]) renders a few 



11 




Class ft 

Fig. 13. — Histogram with the number of spectra in each class for 
the classification of spectra without continuum. The class number 
has been assigned according to the number of stars in the class, 
being the class with the largest number of members. 

differences. First, the faulty classes with a fake emission 
peak in the blue (e.g., class 17 in Fig. @| have disap- 
peared. This is a side-effect of removing the continuum 
from the spectra, which in our implementation blacks out 
the 193 pixels in the extreme wavelengths, thus remov- 
ing the problem. Second, there is a new class that shows 
spectra with emission lines (class 9), which most prob- 
ably are not real but poorly corrected sky li nes at the 
wavele ngths of the Ca II IR triplet (see., e.g.. iLee et al.l 
2008b). Finally, the spectra corresponding to binary sys- 
tems do not form a separate class, so that they have to 
show up as outliers of the classification (Sect. [5]). 

The classes are shown in the u — g vs g ~ r color plot 
in Fig. [T7] They overlap more than the classes inferred 
when the continuum is included; see Fig. |H1 There are 
several conclusions to be drawn from the comparison of 
these two figures. The continua influence the classifi- 
cation or, in other words, the classifications with and 
without continua do not fully agree. However, most 
classes can be viewed as mergers of classes with contin- 
uum. Even if the continuum is not included for classi- 
fying, the different classes have different colors - classes 
and 9 represent an exception since they overlap in the 
color-color plot (see Fig. IT?]) . The color-color plot also 
shows two parallel sequences that split at g — r ~ 0.2 
(or at class 6). As it happens with the classification 
including continuum, the upper branch corresponds to 
main sequence stars whereas the lower branch includes 
giants. This separation by stellar size is more clear in 
Fig-HH which shows the 2D distribution of log(g) vs T e s 
for the stars in a number of selected classes - the classes 
in the lower branch of Fig. [T7] include low gravity stars 
in Fig. [T5] (see classes 6 and 2). Figure IT91 shows the 
2D distribution of [Fe/H] vs T e s for the full set of stars, 
together with contours with the region containing 68 % 
of the stars in selected classes. Note how well separated 
are the classes in this plot, in contrast with the overlap 
present in the color-color plot (Fig. H7]l . The behavior is 
opposite to that of the classes resulting from classifying 
spectra with continua, which are well separated in the 
color-color plot (Fig. [9]), but overlap in the [Fe/H]-T e ff 



diagram (Fig. [TTj) . Note also that classes 5 and 6 oc- 
cupy the same region of the color-color plot and overlap 
in the log(g)-T e g plot, but they have different metallici- 
ties. These are F stars that approximately split accord- 
ing to their membership to the thick disk and the halo 
(jAllende Prieto et al.l 120061 ). therefore, the classification 
provides a quick-look tool to separate disk and halo stars. 
Class 12 is dominated by DA WDs (see below and the 
class template in Fig. [TBI , however, it shows up as ex- 
tremely low metallicity in Fig. [TSJ Since the SSPP does 
not deal with WDs, these must have been confused with 
A-type stars and analyzed as such, finding that they are 
best matched with no metals. 

Figure [20] shows the distribution of MK spectral types 
corresponding to the k-means classes. Note how each 
class tends to belong to a single spectral type, but not 
always. Moreover, the correspondence seems to be bet- 
ter than that for the classification including continuum 
(cf., FigfT2^). Note how class 12 is basically formed by 
WDs, whereas classes 8, 10 and 11 are made of type 
A stars. The most numerous class contains almost 
exclusively F9 stars, as it also happens with the classi- 
fication with continuum (Fig. [T2l . We think that the 
concentration of class around a particular type is real, 
but the particular type is not, since most stars selected 
by SEGUE are G-type rather than F-type (see Sect. [5]). 
There seems to be a problem with the MK typing based 
on ELODIE templates because, as expected, the Ham- 
mer MK types associated with class are late G types. 
Figure [UJ is equivalent to Fig. [501 but showing Hammer 
types (Sect. [5]). Class corresponds to types between 
G6 and K2. Note, in passing, that the classes of hot 
stars have disappeared from the histograms since Ham- 
mer typing does not allocate classes to them. 

Table [JJ gives a summary of the physical properties of 
the classes, namely, it contains mean values and standard 
deviations for colors, temperatures, gravities and metal- 
licities. (Class 13 has been excluded since it seems to be 
collecting faulty spectra.) Note that some classes present 
a fairly small range of physical parameters. For instance, 
if a star is assigned to class then we know its tempera- 
ture, gravity and metallicity with standard deviations of 
190 K, 0.25 dex, and 0.36 dex, respectively. These un- 
certainties are comparable to those associated with other 
approaches currently used to estimate effective temper- 
ature and gravity. This fact opens up the possibility of 
using k-means for a quick-look estimate of the main phys- 
ical parameters of the stars at a minimum computational 
cost. Once the classes are known, assigning a new spec- 
trum to one of them is virtually instantaneous - it is just 
a matter of computing the difference between the new 
spectrum and the class templates, and then selecting the 
class of least rms deviation. 

5. OUTLIERS 

Having a classification automatically provides outliers, 
i.e., uncommon objects which therefore do not belong to 
any of the classes. We can easily identify them since, in 
addition to assigning class memberships, our algorithm 
estimates the probability of belonging to the class (see 
Sect. [3]). Outliers are therefore those objects whose prob- 
ability of belonging to their class is below a threshold. 
We set the threshold to 0.01, which implies selecting as 
outliers the 1 % spectra furthest from their clusters cen- 
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Fig. 14. — Template spectra of all the classes in the classification with the continuum removed. We just show a small portion in the 
blue range, otherwise it is impossible to appreciate individual spectral lines. The exception is given by the two panels at the lower right 
corner, where classes 2 and 6 are repeated in their full spectral range. The class nu mber s are given as insets, an some of the main spectral 
features in the region are also labelled. The classes have been ordered following Fig. 1151 (from left to right and from top to bottom). 




TABLE 1 

Properties of the stellar classes from spectra without 
continuum 
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Fig. 15. — Template spectra of all the classes corresponding to 
the classification without continuum. They have been ordered so 
that the image looks smooth. Class 13 on top seems to collect 
spectra with instrumental problems. 



ters. The actual threshold is both arbitrary and unim- 
portant, since our purpose was figuring out the type of 



Class 


g-r 


u-g 


T cff a 


log(g) b 


[Fe/H] c 





0.64±0.08 


1.39±0.20 


5370±190 


4.51±0.25 


-0.53±0.36 


1 


0.53±0.10 


1.03±0.17 


5660±290 


4.13±0.47 


-0.79±0.40 


2 


0.60±0.13 


0.96±0.21 


5450±340 


3.33±0.75 


-1.71±0.53 


3 


0.87±0.13 


1.86±0.29 


4830±230 


4.61±0.22 


-0.52±0.40 


4 


1.02±0.25 


1.70±0.44 


5260±440 


3.74±0.74 


-0.55±0.50 


5 


0.36±0.12 


0.74±0.13 


6400±300 


3.98±0.35 


-0.79±0.45 


6 


0.30±0.20 


0.64±0.23 


6500±330 


3.74±0.47 


-1.86±0.61 


7 


1.31±0.17 


2.24±0.33 


4290±150 


4.39±0.34 


-0.55±0.42 


8 


0.18±0.14 


0.64±0.17 


7250±340 


4.05±0.37 


-0.90±0.62 


9 


0.69±0.16 


1.31±0.35 


5340±330 


4.24±0.47 


-0.89±0.50 


10 


0.02±0.14 


0.59±0.17 


8020±340 


4.12±0.40 


-0.98±0.70 


11 


-0.14±0.08 


0.49±0.12 


8700±350 


3.80±0.47 


-1.36±0.66 


12 


-0.27±0.13 


0.35±0.18 


8890±410 


4.67±0.17 


-3.03±0.66 



a Effective temperature in K. 
b Gravity g in cms -2 . 

c Metallicity in logarithmic scale referred to the Sun. 



spectra that do not fit in the classification. The adopted 
threshold renders some 2200 spectra, that were inspected 
individually. 

We carried out the inspection for both classifications, 
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FlG. 16. — Average spectra of the classes resulting from classifying the spectra without their continua. Even though the continuum was 
removed for classification, it has been included in this average. Not e th e suspicio us shape of class 13, and the emission line features of 
class 9 at some 0.9 //m. The classes have been ordered following Fig. 1151 as in Fig. 1141 
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FlG. 17. — u — g vs g—r plot for the reference set of stellar spectra 
(background image) and for the different classes separately. The 
contours show the region with 68% of the spectra in the class, and 
the centroid of each distribution has been labeled with the class 
number it belongs to. This plot corresponds to the classification 
where the continuum was removed. 
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FlG. 18. — log(g) vs T e ff for a number of selected classes resulting 
from the classification of the stellar spectra with the continuum re- 
moved. The image shows the full set, and the contours indicate the 
regions containing 68 % of the stars in the class. The class num- 
bers appear at the centroid of the distributions. The distribution 
of class 11 has two separated peaks at log(g) ~ 3.2 and 4.2, so that 
the corresponding label appears between them. 



the one including continuum and the one without con- 
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Fig. 19. — [Fe/H] vs T e g for the different classes resulting from 
the classification of the SEGUE stars with their continua removed. 
The image displays the histogram of the full set, whereas the con- 
tours mark the regions containing 68 % of the stars in the class. 
Class numbers are located at the center of the corresponding dis- 
tribution. 

tinuum. In both cases the vast majority of the outliers 
are noisy spectra or failures of the SDSS pipeline (e.g., 
gaps, unsuccessful removal of telluric lines, mismatches 
between the red and the blue spectrograph arms, and so 
on). These problematic spectra represent three-quarters 
of the outliers for the classification using continuum, and 
a bit less when continuum is removed. (This difference is 
to be expected since subtracting the continuum automat- 
ically cancels many of the calibration problems.) Here 
we focus on the outliers of the classification with contin- 
uum, although they are qualitatively similar to those of 
the classification with the continuum subtracted. In this 
case we count 548 genuine outliers. They are illustrated 
in Fig. Upland described in the following list ordered from 
more to less common: 

- Quasars (QSOs) at redshift between 2 and 4, so 
that Lya appears in the visible spectral range (e.g., 
Fig. [22k - we identified the observed lines in th e 
QSO spectrum template by iFrancis et al.l I1991D . 
Most outliers are of this kind (some 320 or 60% 
of the sample) . They represent onl y 0.3% of QSOs 
in th e latest SDSS QSO catalog ((Schneider et al.1 
l2010f ) . but they tend to appear in the redshifts 
where the SDSS identification algorithm has known 
problems (redshifts 2.9 and 3.2). A bad redshift 
assignation at these particular redshifts would ex- 
plain the presence of a large number of QSOs con- 
taminating our stellar sample. 

- Broad Absorption Line (BAL) QSOs at the same 
high redshifts (Fig. [22b'). Those are believed to 
be AGNs with very rapid outflows ( of a few times 
10 4 k ms" 1 ] along the line of sight (jGibson et al.1 
2009). We get 33 of these objects, which represent 
6.0 % of the outliers, and 9.3 % of the QSO - a 
perce ntage that seems to be normal for BAL QSOs 
fe.g.. IGibsonetai]l25oai . 

- Composite spectra of blue and red stars combined 
(e.g., Fig. [22b . where Ha shows up in emission). 



They may be genuine binary systems with the two 
stars gravitationally bounded, or just two or more 
stars that happen to be along our line of sight. 
We ignore why these stars appear as outliers rather 
than elements of existing classes (classes 20 or 21 
in Fig- [4j) , but it may be due to having an excess of 
blue upturn as compared to the template spectra. 

- Flat spectra, showing absorption line features char- 
acteristics of hot stars in the blue, and of cold stars 
in the red (e.g., Fig. |22"H). They may be composite 
spectra like in Fig.l2"2"b. but with the luminosities of 
the stars fine-tuned so that the combination looks 
spectrally flat. 

- Extreme spectra. They look-like the corresponding 
templates, but seem to be extreme cases (e.g., ex- 
treme colors or particularly deep absorption lines). 
Figure l22e shows a particularly cold star or sub- 
stellar object - the spectrum and the correspond- 
ing template are shown as solid and dotted lines, 
respectively. Figure I22T shows the spectrum of a 
star (the solid line) hotter than its template (the 
dotted line). 

- Star-forming galaxies at intermediate redshifts 
(e.g., Fig.[22g). 

- QSOs at redshifts around one (e.g., Fig. [22h). 

- Strongly dust-reddened blue stars (e.g., Fig. l22"p). 

- Carbon rich WDs with strong C2 bands (e.g., 
Fig. [22l and [22l) - see the obse r ved a nd synthetic 
spectra in IWe gner fc Yackovichl (|1984h . 

- Carbon stars (e.g., Fig. [22b). where the pho- 
tospheric opacity is dominated by C-bearing 
molecules. The carbon, dredged up to the pho- 
tosphere, comes from the He burning shell charac- 
teristic of low m ass stars during the ir late stages 
of evolution (e.g.. lAringer et al.ll2009l ). Figure \T2b 
should be compa red with C- star spectra in e.g. , 
iLoidl et all (|200l Fig. 4) and lAringer eTaTl (2009, 
Fig. 3). 

- Strange-looking spectra. They may be failures of 
the reduction pipeline, but they may be genuine 
abnormal objects as well. Figures l2"2"k-[2"2"li show a 
few of them, chosen only when they are not the sole 
representative of its class. They include spectra 
with strong emission lines (e.g., Fig.[22hiL or spec- 
tra with a single absorption line (e.g., Fig. r22h). 

6. DISCUSSION AND CONCLUSIONS 

The traditional approach to classify stellar spectra has 
to be adapted to process the volume of data produced 
by large surveys underway (see Sect.[l|. There is a need 
for new automated techniques of analysis. In this work 
we explore the use of the algorithm k-mcans for the task, 
i.e., as a tool for the automated unsupervised classifica- 
tion of massive stellar spectrum catalogs. The algorithm 
has already proven its potential to fast processing other 
astronomical spectra (Sect. [IJ, and we expected it to be 
useful in this context as well. 
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Class 2 
Class 3 
Class 5 
Class 6 



Class 7 
Class 8 
Class 10 
Class 11 
Class 12 



Fig. 2 0. — Histograms of the distribution of ELODIE MK types of selected k-means classes corresponding to spectra without continuum 
(see Fig|15|l. The histograms have been split in two panels to avoid overcrowding. Each one shows a set of k-means classes as described in 
the insets. The histograms have been normalized to one, including objects where the MK type is not available. 
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Fig. 21. — Similar to Fig. I20l but showing Hammer MK types rather than ELODIE types. The histograms have been normalized to one, 
including objects where the type is not available. Classes 8, 10, 11 and 12 do not appear because they contain h ot s tars with no associated 
Hammer type. Note that class collects types G and early K, whereas it appears as formed by F9 stars in Fig. 1201 
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Fig. 22. — Examples of outliers of the classification that includes continuum, (a) QSO at rcdshift around 4, so that Lya shows up at 
some 6000 A. (b) BAL QSO, with broad absorption lines blue-shifted with respect to their emission line counterparts. The largest signal 
corresponds to Lyct. (c) Composite spectrum formed by combining light from red and blue stars, (d) Flat spectrum, showing signs of a 
hot star in the blue and a red star in the red. They may be composite spectra like in (c) but with the magnitudes of the two stars tuned to 
look spectrally flat, (c) Spectrum of an extremely cold star - the spectrum and the corresponding template are shown as solid and dotted 
lines, respectively, (f) Star (the solid line) significantly hotter than its template (the dotted line), (g) Star- forming galaxy at intermediate 
redshift. (h) QSO at rcdshift around one. (i-j) Carbon-rich WDs. (k— n) Strange-looking spectra. They may be failures of the reduction 
pipeline, but they may be genuine abnormal objects as well. They include spectra with strong emission lines (m), or spectra with a single 
absorption line (n). (o) Carbon stars, (p) Dust-reddened blue stars. 
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For this exploratory application of k-means, we se- 
lected the data set of stellar spectra associated with the 
SEGUE and SEGUE 2 programs. Even though it is not 
a fair sample of the Milky Way stellar populations, it 
contains a rich variety of stellar types and so it is a 
good test-bench for the classification algorithm. After 
discarding faulty cases, the reference dataset consists of 
173,390 stellar spectra from 3800 to 9200 A sampled in 
3849 wavelengths. Therefore, the problem for k-means is 
to find clusters among 173,390 vectors defined in a 3849- 
dimensional space. The full data set occupies 6.1GB, 
and is classified using a standard up-to-date workstation 
in some two hours (see Sect. 13. ip . 

We apply the classification to the original spectra and 
also to the spectra with the continuum removed. The lat- 
ter data set contains only spectral lines, and it is less de- 
pendent on observational and instrumental problems like 
dust extinction, spectro-photometric mis-calibrations, or 
failures in the reduction pipeline. 

The classification of the spectra with continuum ren- 
ders 16 major classes, with 99% of the objects, and 
ten minor classes with the remaining 1% (Sect. 14. ip . 
Roughly speaking, the stars are split according to their 
colors, with enough finesse to distinguish dwarf and gi- 
ant stars (Fig. [9]). Figure 2] shows the template spec- 
tra representative of all the classes: there are classes 
for WDs (class 15), A-type stars (class 9), F-type stars 
(class 0), K-type stars (class 8), M-type stars (class 23), 
dust- reddened intrinsically blue stars (class 18), binary 
systems (class 21), and even classes with faulty spec- 
tra (classes 17 and 19). It must be stressed, however, 
that there is not a one-to-one correspondence between 
the classes we derived and the MK types. Often our 
classes mix-up several MK types, and vice versa. The 
classification is able to separate stars with similar tem- 
peratures but different surface gravities (compare classes 
and 3 in Fig. [TU|) . but has difficulties to separate stars 
with different metallicities (Fig. [TTj) . 

The classification of spectra without continuum ren- 
ders less classes (Sect. 14. 2\i - 13 major classes and only 1 
minor class that probably collects faults of the reduction 
pipeline (Figs. [T4l and H6)). In this case the color sepa- 
ration is not so sharp as it is for the classification with 
continuum included (cf., Figs 191 and ITT)) . However, it is 
able to separate stars in classes with the same effective 
temperatures but different metallicities (Fig. [T9")) . The 
behavior is opposite to that of the classes resulting from 
classifying spectra with continua, which are well sepa- 
rated in the color-color plot but overlap in [Fe/H] vs T e s- 

Some classes include starts with a fairly small range of 
physical parameters, as assigned by the SSPP. The mean 
value and dispersion of the effective temperature, surface 
gravity, and metallicity of the classes without continuum 
are listed in Table [T] A small dispersion implies that our 
classification can be used to estimate the main physi- 
cal parameters of the stars at a minimum computational 
cost. One only has to assign the problem spectrum to 
one of the existing classes, e.g., to the one of minimum 
residual. Then the properties of the class can be passed 
on to the new spectrum, thus providing its main physical 
properties. For example, if the problem spectrum hap- 
pens to belong to class 0, then we know its temperature, 
gravity, and metallicity within a standard deviation of 



190 K, 0.25 dex, and 0.36 dex, respectively. These un- 
certainties are probably upper limits since the estimate of 
the physical parameters by the SSPP has non-negligible 
internal errors that are included in the dispersions. Note 
that the uncertainties are comparable with those asso- 
ciated with other approaches currently used to estimate 
effective temperature and gravity, which are far more 
time consuming. Moreover, since we derive physical pa- 
rameters from spectra without continuum, the estimates 
are fairly robust against dust reddening and other obser- 
vational issues, which are often a serious problem when 
dealing with stars at low galactic latitudes. 

The classification also provides a means of finding rare 
but scientifically interesting objects, e.g., unusually low 
metallicity stars, odd spectral types, etc. By definition, 
rare objects must be outliers of any classification, oth- 
erwise they would be common and would have classes 
associated to them. Our rendering of k-means gives the 
goodness of the assignation, i.e., the probability that each 
star belongs to the class it has been assigned to. There- 
fore, the outliers of the classification are easy to pinpoint 
as those spectra whose probability of a correct assigna- 
tion is low enough. The nature of the outliers thus se- 
lected was examined in Sect. \5\- see Fig. [35J Most out- 
liers are faulty data or failures of the SDSS reduction 
pipeline. The remaining 25 % is firstly formed by high 
redshift QSOs. Since they are in the appropriate redshift 
range, we speculate that these QSO may be those lost by 
a known problem in the SDSS QSO identification algo- 
rithm. There is a large number of outliers corresponding 
to composite spectra formed by either real or fake dou- 
ble or multiple stellar systems. The spectrum is that of 
a hot star in the blue and a cold star in the red. There 
are also reddened stellar spectra, and galaxy spectra. Fi- 
nally, there are odd spectral types whose nature we did 
not manage to figure out, and which we plan to observe 
in follow up work. 

One obvious use of the present classification is identi- 
fying spectra having instrumental problems or being pro- 
duced by flaws of the reduction pipeline. We find classes 
containing faulty spectra when the problem is common, 
and then we find faulty spectra as outliers of the classi- 
fication when the problem is unusual. 

Stellar spectra are known to be highly compressible so 
that they can be characterized using only a few inde- 
pendent parameters (see Sect.[T]). Then the fact that the 
classes present a regular behavior was somehow expected, 
and this fact is not the main outcome of our exercise. In- 
stead, our exploratory work shows k-means to provide a 
viable tool for the systematic classification of large data 
sets of stellar spectra. Moreover, there is plenty of room 
for improving the procedure, i.e., for upgrades that have 
not been considered in the paper, but which may be of 
interest in future uses. One can focus the classification in 
a particular spectral range (or set of ranges) particularly 
sensitive to the physical parameter one wants to select 
(say, the metallicity if searching for classes of extremely 
metal poor stars). Then the resulting classes would em- 
phasize this particular aspect of the spectra. Obviously, 
using smaller spectral ranges for classification also speeds 
up the procedure. One can also resort to nested k-means 
classifications, where the spectra of a given class are sep- 
arated into subclasses. This can be used to fine-tuning 
separation. 
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Finally, we want to indicate that the template spectra 
from the classifications with and without continuum are 
publicly available- 
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